Scikit-Learn Pipelines
Reading time: ~45 minutes | Level: Intermediate-Advanced
The Data Leakage Bug
You trained a fraud detection model. Validation AUC was 0.97. In production, AUC dropped to 0.72. The model looked perfect in evaluation, then immediately degraded.
The culprit was a single line written during data preparation:
from sklearn.preprocessing import StandardScaler
import numpy as np
# WRONG -- this is data leakage
scaler = StandardScaler()
X_all_scaled = scaler.fit_transform(X) # fitted on ALL data including the test set
X_train_scaled = X_all_scaled[train_idx]
X_test_scaled = X_all_scaled[test_idx]
# The scaler has seen the test set mean and variance.
# The model's validation metric is now optimistic -- it's been trained
# on information from the future (test set statistics).
The fix is mechanical when you use Pipelines:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
# The Pipeline fits the scaler ONLY on training data during cross-validation.
# The scaler is applied (not re-fitted) when transforming the test fold.
pipe = Pipeline([
("scaler", StandardScaler()),
("clf", LogisticRegression()),
])
pipe.fit(X_train, y_train) # scaler.fit_transform(X_train), clf.fit(X_train_scaled, y_train)
pipe.predict(X_test) # scaler.transform(X_test), clf.predict(X_test_scaled)
This lesson is about making Pipelines a natural first instinct -- not an afterthought.
Why This Matters
Pipelines are not a convenience feature. They are the mechanism by which preprocessing becomes part of the model rather than a separate, error-prone script. Without a Pipeline:
- Every preprocessing step must be repeated manually at inference time (and often diverges from training)
- Cross-validation leaks test-fold statistics into transformers fitted on the full training set
- Hyperparameter search across preprocessing choices requires manual bookkeeping
- Serialising the model for deployment requires serialising multiple separate objects
With a Pipeline, you get a single object that is correct-by-construction, safe to cross-validate, easy to deploy, and trivial to serialise.
1. Pipeline Internals: The fit/transform Protocol
Every step in a Pipeline except the last must implement fit, transform, and fit_transform. The last step must implement fit and predict (or predict_proba, score, etc.).
The key insight: fit_transform is only called during pipe.fit(). During pipe.predict() and pipe.transform(), only transform() is called on intermediate steps -- the fitted state (learned mean, variance, components, etc.) is frozen.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
import numpy as np
pipe = Pipeline([
("scaler", StandardScaler()),
("pca", PCA(n_components=10, random_state=42)),
("clf", RandomForestClassifier(n_estimators=100, random_state=42)),
])
# Access intermediate steps by name
print(pipe.named_steps["scaler"]) # StandardScaler()
print(pipe["scaler"]) # same, using dict-like access (sklearn >= 0.23)
# After fitting, inspect the fitted state of any step
X_dummy = np.random.randn(100, 20)
y_dummy = np.random.randint(0, 2, 100)
pipe.fit(X_dummy, y_dummy)
print(pipe["scaler"].mean_[:5]) # per-feature means, fitted on X_dummy
print(pipe["pca"].explained_variance_ratio_[:5])
# Access transformed output at any intermediate stage
X_after_scaler = pipe["scaler"].transform(X_dummy)
X_after_pca = pipe[:-1].transform(X_dummy) # all steps except the last
2. ColumnTransformer: Heterogeneous Data
Real ML datasets rarely have uniform feature types. You have numerical features that need scaling, categorical features that need encoding, and text features that need vectorisation. ColumnTransformer applies different transformers to different column subsets in parallel, then horizontally concatenates the results.
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
# Realistic tabular dataset
df = pd.DataFrame({
"age": [25, 32, np.nan, 41, 28],
"income": [40000, 75000, 60000, np.nan, 55000],
"gender": ["M", "F", "F", "M", "M"],
"education": ["Bachelor", "Master", "PhD", "Bachelor", "Master"],
"city": ["NYC", "LA", "NYC", "Chicago", "LA"],
"default": [0, 0, 1, 0, 1],
})
X = df.drop(columns="default")
y = df["default"].values
# Define column groups by type
numerical_cols = ["age", "income"]
ordinal_cols = ["education"]
nominal_cols = ["gender", "city"]
# Sub-pipelines for each group
numerical_pipe = Pipeline([
("imputer", SimpleImputer(strategy="median")), # robust to outliers
("scaler", StandardScaler()),
])
ordinal_pipe = Pipeline([
("imputer", SimpleImputer(strategy="most_frequent")),
("encoder", OrdinalEncoder(
categories=[["Bachelor", "Master", "PhD"]], # explicit order matters
handle_unknown="use_encoded_value",
unknown_value=-1,
)),
])
nominal_pipe = Pipeline([
("imputer", SimpleImputer(strategy="most_frequent")),
("encoder", OneHotEncoder(
handle_unknown="ignore", # silently ignore unseen categories at inference
sparse_output=False, # return dense array (sklearn >= 1.2)
)),
])
# Combine into a ColumnTransformer
preprocessor = ColumnTransformer(
transformers=[
("num", numerical_pipe, numerical_cols),
("ord", ordinal_pipe, ordinal_cols),
("nom", nominal_pipe, nominal_cols),
],
remainder="drop", # drop any columns not listed above
verbose_feature_names_out=False, # cleaner feature names
)
# The full Pipeline
from sklearn.linear_model import LogisticRegression
full_pipe = Pipeline([
("preprocessor", preprocessor),
("clf", LogisticRegression(max_iter=1000, random_state=42)),
])
full_pipe.fit(X, y)
# Inspect feature names out of the preprocessor
feature_names = full_pipe["preprocessor"].get_feature_names_out()
print(feature_names)
# ['age', 'income', 'education', 'gender_F', 'gender_M', 'city_Chicago', 'city_LA', 'city_NYC']
Why remainder="drop"? Explicitly listing all columns forces you to think about every feature. The alternative remainder="passthrough" silently passes unlisted columns through unchanged -- dangerous if a column leaks target information.
3. Custom Transformers
The real power of the Pipeline abstraction is that any class implementing fit, transform, and fit_transform can be a step. Sklearn provides BaseEstimator and TransformerMixin as mix-ins that give you get_params, set_params, and a default fit_transform = fit + transform.
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
class LogTransformer(BaseEstimator, TransformerMixin):
"""
Applies log1p to right-skewed numerical features.
Why not just use FunctionTransformer?
Because this class supports feature-name-aware output via get_feature_names_out,
stores the features it was fitted on (for validation at transform time),
and can be serialised and inspected like any sklearn estimator.
"""
def __init__(self, add_original: bool = False) -> None:
# All hyperparameters must be stored as attributes with the same name.
# BaseEstimator.get_params() uses inspect to discover them from __init__.
self.add_original = add_original
def fit(self, X: np.ndarray, y=None) -> "LogTransformer":
# Store the number of features seen during fit for validation
X = self._validate_data(X) # sklearn helper: validates array shape and type
self.n_features_in_ = X.shape[1]
return self
def transform(self, X: np.ndarray, y=None) -> np.ndarray:
# Check that X has the same number of features as during fit
X = self._validate_data(X, reset=False)
log_X = np.log1p(np.abs(X)) * np.sign(X) # signed log for negative values
if self.add_original:
# Append original features as additional columns
return np.hstack([log_X, X])
return log_X
def get_feature_names_out(self, input_features=None) -> np.ndarray:
names = [f"log_{f}" for f in self._get_feature_names(input_features)]
if self.add_original:
orig = list(self._get_feature_names(input_features))
names += orig
return np.array(names)
def _get_feature_names(self, input_features) -> list[str]:
if input_features is not None:
return list(input_features)
return [f"x{i}" for i in range(self.n_features_in_)]
class WinsorisationTransformer(BaseEstimator, TransformerMixin):
"""
Clips features to [lower_quantile, upper_quantile] of the TRAINING distribution.
This is the correct way to handle outliers: the clip bounds are learnt from
training data only, then applied to test data without re-fitting.
"""
def __init__(self, lower: float = 0.01, upper: float = 0.99) -> None:
self.lower = lower
self.upper = upper
def fit(self, X: np.ndarray, y=None) -> "WinsorisationTransformer":
X = self._validate_data(X)
# Compute bounds per feature from training data
self.lower_bounds_ = np.quantile(X, self.lower, axis=0)
self.upper_bounds_ = np.quantile(X, self.upper, axis=0)
return self
def transform(self, X: np.ndarray, y=None) -> np.ndarray:
X = self._validate_data(X, reset=False)
# Clip using bounds learnt at fit time -- not re-computed from test data
return np.clip(X, self.lower_bounds_, self.upper_bounds_)
# Compose custom transformers in a Pipeline
from sklearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingClassifier
rng = np.random.default_rng(0)
X = rng.lognormal(0, 1, size=(500, 6)) # right-skewed features
y = rng.integers(0, 2, size=500)
pipe = Pipeline([
("winsorise", WinsorisationTransformer(lower=0.02, upper=0.98)),
("log", LogTransformer(add_original=False)),
("clf", GradientBoostingClassifier(random_state=42)),
])
from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipe, X, y, cv=5, scoring="roc_auc")
print(f"CV AUC: {scores.mean():.3f} +/- {scores.std():.3f}")
4. FeatureUnion: Parallel Feature Extraction
FeatureUnion applies multiple transformers in parallel and concatenates their outputs. The classic use case is combining TF-IDF features from text with handcrafted numeric features.
import numpy as np
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.base import BaseEstimator, TransformerMixin
class TextSelector(BaseEstimator, TransformerMixin):
"""Selects a single text column from a DataFrame."""
def __init__(self, key: str) -> None:
self.key = key
def fit(self, X, y=None): return self
def transform(self, X): return X[self.key].fillna("")
class NumericSelector(BaseEstimator, TransformerMixin):
"""Selects numeric columns from a DataFrame."""
def __init__(self, keys: list[str]) -> None:
self.keys = keys
def fit(self, X, y=None): return self
def transform(self, X): return X[self.keys].fillna(0).values
# Text branch: extract TF-IDF features, then reduce with SVD (LSA)
text_branch = Pipeline([
("selector", TextSelector(key="review_text")),
("tfidf", TfidfVectorizer(max_features=5000, ngram_range=(1, 2))),
("svd", TruncatedSVD(n_components=50, random_state=42)), # dense 50-d
("scaler", StandardScaler()),
])
# Numeric branch: scale structured features
numeric_branch = Pipeline([
("selector", NumericSelector(keys=["word_count", "rating", "verified_purchase"])),
("scaler", StandardScaler()),
])
# Combine both branches in parallel
combined_features = FeatureUnion([
("text", text_branch),
("numeric", numeric_branch),
])
# Full Pipeline
from sklearn.linear_model import LogisticRegression
full_pipe = Pipeline([
("features", combined_features),
("clf", LogisticRegression(C=1.0, max_iter=1000)),
])
# The Pipeline handles the DataFrame -> features -> predictions chain correctly,
# with no manual array management.
FeatureUnion vs ColumnTransformer: prefer ColumnTransformer for tabular data (it handles DataFrames natively and is more explicit about column selection). Use FeatureUnion when you need truly parallel pipelines with heterogeneous input types (e.g., combining image and text features).
5. Pipeline + GridSearchCV: Correct Hyperparameter Search
The Pipeline's most important property: it can be passed directly to GridSearchCV or RandomizedSearchCV, and sklearn ensures the transformer is fitted only on the training fold, not the validation fold.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, StratifiedKFold
import numpy as np
pipe = Pipeline([
("scaler", StandardScaler()),
("pca", PCA(random_state=42)),
("svm", SVC(probability=True)),
])
# Hyperparameter grid: use double underscore to target steps by name
param_grid = {
"pca__n_components": [5, 10, 20], # PCA hyperparameter
"svm__C": [0.1, 1.0, 10.0], # SVM regularisation
"svm__kernel": ["rbf", "linear"], # SVM kernel
}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
search = GridSearchCV(
pipe,
param_grid,
cv=cv,
scoring="roc_auc",
n_jobs=-1,
verbose=1,
refit=True, # after search, refit best config on entire training set
)
# Generate toy data
rng = np.random.default_rng(42)
X = rng.normal(size=(300, 25))
y = (X[:, 0] + X[:, 1] > 0).astype(int)
search.fit(X, y)
print(f"Best params: {search.best_params_}")
print(f"Best CV AUC: {search.best_score_:.4f}")
# search.best_estimator_ is a full Pipeline, ready for inference
best_pipe = search.best_estimator_
proba = best_pipe.predict_proba(X[:5])
What safe CV with Pipelines actually prevents: with a plain GridSearchCV over manually preprocessed data, StandardScaler.fit() is called on all training data including the validation fold at each CV split. The scaler learns the validation fold's mean and variance, leaking information. When the scaler is inside the Pipeline, sklearn calls scaler.fit_transform(X_train_fold) and scaler.transform(X_val_fold) -- the validation fold never influences the scaler.
6. The set_output API (sklearn >= 1.2)
Before sklearn 1.2, Pipeline transformers always returned numpy arrays, making it impossible to track feature names through the chain. The set_output API lets transformers return DataFrames, preserving column names end-to-end.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
import pandas as pd
import numpy as np
df = pd.DataFrame({
"age": [25, 32, 41, 28, 35],
"income": [40000.0, 75000.0, 60000.0, 55000.0, 68000.0],
"gender": ["M", "F", "F", "M", "F"],
})
preprocessor = ColumnTransformer([
("num", StandardScaler(), ["age", "income"]),
("cat", OneHotEncoder(sparse_output=False), ["gender"]),
])
# Enable DataFrame output globally for this transformer chain
preprocessor.set_output(transform="pandas")
X_transformed = preprocessor.fit_transform(df)
print(type(X_transformed)) # pandas.core.frame.DataFrame
print(X_transformed.columns.tolist())
# ['age', 'income', 'gender_F', 'gender_M']
This makes debugging transformations far easier: pipe[:-1].transform(X) returns a labelled DataFrame instead of an anonymous array.
7. Pipeline Persistence with joblib
A Pipeline is a single serialisable object. Save it once; load it anywhere.
import joblib
from pathlib import Path
from sklearn.pipeline import Pipeline
def save_pipeline(pipe: Pipeline, path: str | Path) -> None:
"""
Persists a fitted Pipeline using joblib.
joblib is preferred over pickle for sklearn objects because it handles
numpy arrays more efficiently via memory-mapped files.
"""
path = Path(path)
path.parent.mkdir(parents=True, exist_ok=True)
joblib.dump(pipe, path, compress=3) # compress=3 is a good size/speed tradeoff
print(f"Pipeline saved to {path} ({path.stat().st_size / 1024:.1f} KB)")
def load_pipeline(path: str | Path) -> Pipeline:
"""Loads a fitted Pipeline from disk."""
pipe = joblib.load(path)
print(f"Loaded pipeline: {pipe.steps}")
return pipe
# --- Training side ---
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import numpy as np
rng = np.random.default_rng(0)
X_train = rng.normal(size=(200, 10))
y_train = rng.integers(0, 2, size=200)
pipe = Pipeline([
("scaler", StandardScaler()),
("clf", LogisticRegression()),
])
pipe.fit(X_train, y_train)
save_pipeline(pipe, "models/baseline_v1.joblib")
# --- Inference side (different process / server) ---
loaded_pipe = load_pipeline("models/baseline_v1.joblib")
X_new = rng.normal(size=(5, 10))
predictions = loaded_pipe.predict(X_new)
print(predictions)
Version pinning: the loaded Pipeline uses the sklearn version it was serialised with. Always record the sklearn version alongside the saved model file. A version mismatch can silently produce wrong predictions if an API changed between versions.
8. Custom Estimator for Business Logic
Sometimes the final step in a Pipeline is not a standard sklearn estimator. You need a class that wraps business logic: threshold adjustment, post-processing, or an external model.
import numpy as np
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils.validation import check_is_fitted, check_X_y
class ThresholdClassifier(BaseEstimator, ClassifierMixin):
"""
Wraps any probability-outputting classifier and applies a custom
decision threshold instead of the default 0.5.
Use case: in fraud detection you may want threshold=0.2 to maximise
recall at the cost of precision -- this is a business decision, not
a statistical one, and it belongs in the Pipeline.
"""
def __init__(self, base_clf, threshold: float = 0.5) -> None:
self.base_clf = base_clf
self.threshold = threshold
def fit(self, X: np.ndarray, y: np.ndarray) -> "ThresholdClassifier":
X, y = check_X_y(X, y)
self.base_clf.fit(X, y)
self.classes_ = self.base_clf.classes_
return self
def predict_proba(self, X: np.ndarray) -> np.ndarray:
check_is_fitted(self)
return self.base_clf.predict_proba(X)
def predict(self, X: np.ndarray) -> np.ndarray:
check_is_fitted(self)
proba = self.predict_proba(X)[:, 1] # positive class probability
return (proba >= self.threshold).astype(int)
def score(self, X: np.ndarray, y: np.ndarray) -> float:
"""Default score is accuracy at the custom threshold."""
return (self.predict(X) == y).mean()
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
pipe = Pipeline([
("scaler", StandardScaler()),
("clf", ThresholdClassifier(
base_clf=LogisticRegression(max_iter=1000),
threshold=0.25, # capture more fraud at the cost of false positives
)),
])
rng = np.random.default_rng(1)
X = rng.normal(size=(300, 10))
y = rng.binomial(1, 0.15, size=300) # 15% positive (fraud-like imbalance)
pipe.fit(X, y)
preds = pipe.predict(X[:5])
print(preds)
# The threshold is a Pipeline hyperparameter -- searchable with GridSearchCV
from sklearn.model_selection import GridSearchCV
search = GridSearchCV(
pipe,
param_grid={"clf__threshold": [0.2, 0.25, 0.3, 0.4]},
scoring="recall", # optimise for catching fraud
cv=5,
)
search.fit(X, y)
print(f"Best threshold: {search.best_params_['clf__threshold']}")
9. Production Patterns
Pattern 1: Train/Test Split OUTSIDE the Pipeline
from sklearn.model_selection import train_test_split
# Always split BEFORE building the pipeline.
# The pipeline itself handles fit/transform separation during CV.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
pipe.fit(X_train, y_train)
test_score = pipe.score(X_test, y_test)
Pattern 2: Feature Names Through the Full Pipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
# After fitting, retrieve feature names at the output of any step
preprocessor.fit(X_train)
output_names = preprocessor.get_feature_names_out()
# Map these back to tree model feature importances
importances = pipe["clf"].feature_importances_
importance_df = pd.DataFrame({
"feature": output_names,
"importance": importances,
}).sort_values("importance", ascending=False)
Pattern 3: Pipeline Cloning for Safe Experiment Tracking
from sklearn.base import clone
base_pipe = Pipeline([
("scaler", StandardScaler()),
("clf", LogisticRegression()),
])
# clone() creates an unfitted copy with the same hyperparameters.
# Use this to run experiments from a shared baseline without mutation.
experiment_pipe = clone(base_pipe)
experiment_pipe.set_params(clf__C=10.0)
experiment_pipe.fit(X_train, y_train)
Pattern 4: Partial Refit at Inference Time
# If new categories appear in production, you may need to refit only the encoder.
# Pipelines support step replacement without rebuilding.
from sklearn.preprocessing import OrdinalEncoder
pipe.set_params(preprocessor__cat__encoder=OrdinalEncoder(
handle_unknown="use_encoded_value",
unknown_value=-1,
))
# Then refit only on the new data:
pipe.fit(X_new, y_new)
10. Common Mistakes
Mistake 1: Fitting transformers before the Pipeline
# BAD: scaler is fitted on ALL data before any train/test split
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_all) # test data contaminated
X_train = X_scaled[:800]
X_test = X_scaled[800:]
# GOOD: the scaler is inside the Pipeline, fitted only during .fit()
pipe = Pipeline([("scaler", StandardScaler()), ("clf", LogisticRegression())])
pipe.fit(X_train, y_train)
pipe.predict(X_test)
Mistake 2: Using fit_transform at inference time
# BAD: re-fits the scaler on inference data -- different mean/std than training
X_new_scaled = scaler.fit_transform(X_new) # scaler state overwritten!
# GOOD: only transform at inference
X_new_scaled = scaler.transform(X_new) # uses training-time mean/std
# With a Pipeline, this is impossible to get wrong -- pipe.predict() always
# calls transform(), never fit_transform(), on intermediate steps.
Mistake 3: Not using make_pipeline for quick prototypes
from sklearn.pipeline import make_pipeline
# make_pipeline infers step names automatically (lowercase class names).
# Use it for quick prototyping; use Pipeline([...]) when you need named access.
pipe = make_pipeline(StandardScaler(), PCA(10), LogisticRegression())
# Step names: 'standardscaler', 'pca', 'logisticregression'
print(pipe.named_steps.keys())
Mistake 4: Forgetting remainder in ColumnTransformer
# If a DataFrame column is NOT listed in any transformer, the default
# remainder="drop" silently drops it. This is often what you want -- but
# remainder="passthrough" silently includes it unchanged, which can leak
# target-correlated columns you forgot to remove.
# Always be explicit about which columns exist and what to do with them.
Key Takeaways
- A Pipeline is the atomic unit of an ML model. Training code, preprocessing, and inference code must be the same object. Never separate them.
- Data leakage in CV is the silent killer of optimistic metrics. Any transformer fitted before CV sees test-fold data. A Pipeline with
GridSearchCVprevents this by design. - ColumnTransformer applies different preprocessing to different column types in parallel, then concatenates. Use it for every real-world tabular dataset.
- Custom transformers inherit from
BaseEstimator+TransformerMixin. Store all hyperparameters as__init__parameters with matching attribute names; sklearn's grid search depends on this. fit_transformis for training;transformis for inference. A Pipeline enforces this distinction automatically -- you cannot accidentally callfit_transformat inference time.- Serialise Pipelines with
joblib.dump, not raw pickle. Record the sklearn version alongside every saved model. - The
set_output(transform="pandas")API (sklearn >= 1.2) enables DataFrame output throughout the Pipeline, making debugging far easier. - Business logic (threshold adjustment, score calibration, post-processing) belongs in the Pipeline as a custom estimator step -- not in separate inference scripts.
Practice Problems
Problem 1 -- Leakage Audit
Take an existing preprocessing script that manually applies fit_transform before train/test split. Refactor it into a complete Pipeline. Measure the CV AUC before and after refactoring. The before-refactoring AUC should be higher (optimistically biased) than the post-refactoring AUC. Document the difference.
Problem 2 -- Custom Imputer
Write a GroupMedianImputer that imputes missing numerical values with the median of a groupby column (e.g. fill missing income with the median income for that person's city). The group medians must be computed only from training data. Integrate it into a ColumnTransformer Pipeline.
Problem 3 -- Text + Numeric Pipeline
Build a Pipeline for a sentiment classification task. The input is a DataFrame with a text column and three numeric columns (word_count, avg_word_length, exclamation_count). Use a FeatureUnion to combine TF-IDF (text branch) with StandardScaler (numeric branch). Run a RandomizedSearchCV over TF-IDF max_features, SVD n_components, and LogisticRegression C. Report the best validation AUC and the best hyperparameter combination.
Problem 4 -- Pipeline Versioning
Write a VersionedPipeline that wraps sklearn's Pipeline and adds: (a) a metadata dict storing sklearn version, python version, training date, and dataset hash; (b) a save method that writes the Pipeline and metadata to a directory; (c) a class method load that loads and validates that the sklearn version matches the current environment, raising a warning if it does not.
Problem 5 -- Calibrated Probability Pipeline
Wrap a GradientBoostingClassifier in a CalibratedClassifierCV (from sklearn.calibration) inside a Pipeline, and verify that the output probabilities are well-calibrated using a reliability diagram (fraction of positives vs mean predicted probability, binned). Compare calibration before and after the calibration wrapper.
